Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datastore: implement barrier if we see "in failed tx" error. #1418

Merged
merged 1 commit into from
May 27, 2023

Conversation

branlwyd
Copy link
Contributor

For some error modes, Postgres will return an error to the caller & then fail all future statements within the same transaction with an "in failed SQL transaction" error. This effectively means one statement will receive a "root cause" error and then all later statements will receive an "in failed SQL transaction" error. In a pipelined scenario, if our code is processing the results of these statements concurrently--e.g. because they are part of a try_join!/try_join_all group--we might receive & handle one of the "in failed SQL transaction" errors before we handle the "root cause" error, which might cause the "root cause" error's future to be cancelled before we evaluate it. If the "root cause" error would trigger a retry, this would mean we would skip a DB-based retry when one was warranted.

To fix this problem, we (internally) wrap all direct DB operations in run_op. This function groups concurrent database operations into "operation groups", which allow us to wait for all operations in the group to complete (this waiting operation is called "draining"). If we ever observe an "in failed SQL transaction" error, we drain the operation group before returning. Under the assumption that the "root cause" error is concurrent with the "in failed SQL transactions" errors, this guarantees we will evaluate the "root cause" error for retry before any errors make their way out of the transaction code.

Closes #1417.

For some error modes, Postgres will return an error to the caller & then
fail all future statements within the same transaction with an "in
failed SQL transaction" error. This effectively means one statement will
receive a "root cause" error and then all later statements will receive
an "in failed SQL transaction" error. In a pipelined scenario, if our
code is processing the results of these statements concurrently--e.g.
because they are part of a `try_join!`/`try_join_all` group--we might
receive & handle one of the "in failed SQL transaction" errors before we
handle the "root cause" error, which might cause the "root cause"
error's future to be cancelled before we evaluate it. If the "root
cause" error would trigger a retry, this would mean we would skip a
DB-based retry when one was warranted.

To fix this problem, we (internally) wrap all direct DB operations in
`run_op`. This function groups concurrent database operations into
"operation groups", which allow us to wait for all operations in the
group to complete (this waiting operation is called "draining"). If we
ever observe an "in failed SQL transaction" error, we drain the
operation group before returning.  Under the assumption that the "root
cause" error is concurrent with the "in failed SQL transactions" errors,
this guarantees we will evaluate the "root cause" error for retry before
any errors make their way out of the transaction code.
@branlwyd branlwyd requested a review from a team as a code owner May 26, 2023 22:04
Copy link
Collaborator

@divergentdave divergentdave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@branlwyd
Copy link
Contributor Author

(I'm running my test_until_failure.sh script against the janus_daphne test -- it took a few hours to observe a failure on my workstation last time, I'm going to leave it running for awhile before merging.)

Copy link
Contributor

@inahga inahga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the detailed explanations!

@branlwyd branlwyd merged commit 7ad8faa into main May 27, 2023
@branlwyd branlwyd deleted the bran/op-group branch May 27, 2023 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Future cancellation impacts decision to retry transactions
4 participants